Tutorial 10 - Clustering¶
Lecture and Tutorial Learning Goals:¶
After completing this week's lecture and tutorial work, you will be able to:
- Describe a case where clustering would be an appropriate tool, and what insight it would bring from the data.
- Explain the K-means clustering algorithm.
- Interpret the output of a K-means cluster analysis.
- Perform K-means clustering in R
- Visualize the output of K-means clustering in R using a coloured scatter plot
- Identify when it is necessary to scale variables before clustering and do this using R
- Use the elbow method to choose the number of clusters for k-means
- Describe advantages, limitations and assumptions of the kmeans clustering algorithm.
This worksheet covers parts of the Clustering chapter of the online textbook. You should read this chapter before attempting the worksheet.
### Run this cell before continuing.
library(tidyverse)
library(tidymodels)
library(tidyclust)
library(repr)
library(GGally)
options(repr.matrix.max.rows = 6)
source('tests.R')
source("cleanup.R")
1. Pokemon¶
We will be working with the Pokemon dataset from Kaggle, which can be found here. This dataset compiles the statistics on 721 Pokemon. The information in this dataset includes Pokemon name, type, health points, attack strength, defensive strength, speed points etc. These are values that apply to a Pokemon's abilities (higher values are better). We are interested in seeing if there are any sub-groups/clusters of pokemon based on these statistics. And if so, how many sub-groups/clusters there are.

Source: https://media.giphy.com/media/3oEduV4SOS9mmmIOkw/giphy.gif
Question 1.0
{points: 1}
Use read_csv to load pokemon.csv from the data/ folder.
Assign your answer to an object called pokemon_full.
# your code here
#fail() # No Answer - remove if you provide an answer
pokemon_full <- read_csv("data/pokemon.csv")
Question 1.1
{points: 1}
To start exploring the Pokemon data, create a scatter plot matrix (or pairplot) using ggpairs. The plot should only contain the columns Total to Speed from pm_data. You can check the data wrangling chapter in the textbook to recall how to select a range of columns using select with :.
Assign your answer to an object called pokemon_pairs. Make sure to set a suitable size for the plot.
options(repr.plot.height = 12, repr.plot.width = 12)
#
pokemon_pairs <- pokemon_full |> select(Total:Speed) |>
ggpairs(aes(alpha = 0.05)) +
theme(text = element_text(size = 20))
# your code here
#fail() # No Answer - remove if you provide an answer
pokemon_pairs
Question 1.2
{points: 1}
From the pairplot above, it does not look like the pokemon are separated into clear groups in any of the pairwise variable scatterplots. Here, we will continue exploring the relationship between Speed and Defense and see what happens if we try to cluster the data points on these two variables although there are no visually discernable variables in the chart.
First, select the columns Speed and Defense, creating a new dataframe with only those columns.
Assign your answer to an object named pokemon.
# your code here
#fail() # No Answer - remove if you provide an answer
pokemon <- select(pokemon_full, "Speed", "Defense")
pokemon
Question 1.3
{points: 1}
Next, create a scatter plot of only these two variables so that we can look close at their relationship. Put the Speed variable on the x-axis, and the Defense variable on the y-axis.
Assign your plot to an object called pokemon_scatter. Don't forget to do everything needed to make an effective visualization, including setting an appropriate alpha value of the points.
# your code here
#fail() # No Answer - remove if you provide an answer
options(repr.plot.height = 7, repr.plot.width =7 )
pokemon_scatter <- pokemon|>
ggplot (aes (x = Speed, y = Defense))+
geom_point(alpha = 0.3478956)+
labs(x = "Speed (Base modifier)", y = "Base Damage Resistance")
pokemon_scatter
Question 1.4.1
{points: 3}
The chart above confirms what we saw in the pairplot; there doesn't seem to be visually distinct clusters of points in these two dimensions. Could it still be informative to run clustering with this data? Let's find out by using K-Means to cluster the Pokemon based on their Speed and Defense.
So far when using K-Means, we have scaled our input features. Will it matter much for our clustering if we scale our variables for the pokemon data? Is there any argument against scaling here?
I think it could still be extremely informative to run clustering on this data. One piece of information we could potentially gather could be about the pokemon type. We can cluster to see if water types tend towards a particular attribute, or vice versa.
Scaling always matters. It means that every variable will be more comparable, and on the same scale. This means that the distance between neighbors will always be "accurate" and predictable.
The only argument against scaling is that values will always be a little distorted, in that you won't be able to tell what the highest value is. (Eg, can't tell the max attack, since it'll just be 1). Though, this is a problem that happens due to scaling, and will always happen.
Question 1.4.2
{points: 1}
Now, let's use K-means to cluster the Pokemon based on their Speed and Defense variables.
- Create a recipe named
pokemon_recipethat standardizes the data - Create a model specification named
pokemon_specfor K-means clustering with 4 clusters. - Fit the model using a
tidymodelsworkflow; call the output of thefit()functionpokemon_clustering.
Assign your answers to objects called pokemon_recipe, pokemon_spec, and pokemon_clustering.
Note: We set the random seed here because K-means initializes observations to random clusters.
#DON'T CHANGE THE SEED VALUE BELOW!
set.seed(2019)
pokemon_recipe <- recipe( ~ . , pokemon) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
pokemon_spec <- k_means(num_clusters = 4) |>
set_engine("stats")
pokemon_clustering <- workflow() |>
add_recipe(pokemon_recipe) |>
add_model(pokemon_spec) |>
fit(data = pokemon)
# your code here
#fail() # No Answer - remove if you provide an answer
pokemon_clustering
Question 1.5
{points: 1}
Let's visualize the clusters we built in pokemon_clustering. Use the augment function and create a coloured scatter plot of Speed (x-axis) vs Defense (y-axis) with the points coloured by their cluster assignment.
Name this plot pokemon_clustering_plot.
# your code here
#fail() # No Answer - remove if you provide an answer
aug <- augment(pokemon_clustering ,pokemon)
pokemon_clustering_plot <- aug|>
ggplot (aes (x = Speed, y = Defense, colour = .pred_cluster))+
geom_point(alpha = 0.62)+
labs(x = "Speed (Base modifier)", y = "Base Damage Resistance", colour = "Cluster Type")
pokemon_clustering_plot
Question 1.6
{points: 3}
Below you can see multiple initializations of k-means with different seeds for K = 4. Can you explain what is happening and how we can mitigate this in the kmeans function?

When you have different values in your set.seed argument, you are using different patterns of "randomness". We use seeds to randomize how we group the data we have. In this case, it is we are randomizing our cluster types. As a result of using different seeds, our things will be grouped differently! This can also lead to a differing number of "groups"
When using kmeans, we will typically use num_cluster to tell the algorithm to cluster (or sort) everything into "n" number of groups. Doing this means that there was always only be n clusters, which leads to more consistent results. We should also combine it with the "n_start =" argument to further make our data consistent.
Question 1.7
{points: 1}
We know that comparing how the WSSD varies for multiple values of $K$ is an important step of selecting a suitable clustering model. That's what we will do next!
For this exercise, you will calculate the total within-cluster sum-of-squared distances for $K$ = 1 to $K$ = 10.
- Create a tibble with the desired values of $K$.
- Create a new model specification that sets
nstartto 10 and tellsk_meansyou want to tune the number of clusters. - Create a new workflow that uses
tune_clusterto tune the number of clusters - Use the
collect_metricsfunction to collect the results. - Use
filter,select, andmutatefunctions to construct a tibble with two columns namednum_clustersandtotal_WSSD. Store that tibble in an object namedelbow_stats.
Assign your answer to a tibble object named elbow_stats. It should have the columns num_clusters and total_WSSD.
set.seed(2020) # DO NOT REMOVE
# your code here
#fail() # No Answer - remove if you provide an answer
k_vals <- tibble(num_clusters = 1:10)
spec_tune <- k_means(num_clusters = tune()) |>
set_engine("stats", nstart = 10)
kmeans_tuning_stats <-workflow()|>
add_recipe(pokemon_recipe) |>
add_model(spec_tune)|>
tune_cluster(resamples = apparent(pokemon), grid = k_vals) |>
collect_metrics()
elbow_stats <- kmeans_tuning_stats |>
mutate(total_WSSD = mean) |>
filter(.metric == "sse_within_total") |>
select(num_clusters, total_WSSD)
elbow_stats
Question 1.8
{points: 1}
Let's visualize how WSSD changes for as we vary the value of $K$. To do this, create the elbow plot. Put the within-cluster sum of squares on the y-axis, and the number of clusters on the x-axis.
Assign your plot to an object called elbow_plot.
# your code here
#fail() # No Answer - remove if you provide an answer
elbow_plot <- elbow_stats|>
ggplot (aes (x = num_clusters, y = total_WSSD))+
geom_line()+
geom_point()+
xlab("K value")+
ylab("Total Within Cluster \n Sum of Squares")+
scale_x_continuous(breaks = seq(0, 10, by = 1))
elbow_plot
Question 1.9
{points: 3}
Based on the elbow plot above, what value of $K$ would you choose? Explain why.
I would pick K = 3.
K = 3 is chosen because when we increase K futher, the WSSD decreases by an insignficant amount. Having a K value that's too high could also cause problems, one of them being over-clustering (too different clusters).
If we chose K = 1 or 2, it would cluster together too many things. This will inturn, cause signficant issues in terms of grouping. (Too many unrelated things grouped together).
Question 1.10
{points: 3}
Using the value that you chose for $K$, perform the K-means algorithm, set nstart = 10 and assign your answer to an object called pokemon_final_kmeans.
Augment the data with the final cluster labels and assign your answer to an object called pokemon_final_clusters.
Finally, create a plot called pokemon_final_clusters_plot to visualize the clusters. Include a title, colour the points by the cluster and make sure your axes are human-readable.
set.seed(2019) # DO NOT REMOVE
# your code here
#fail() # No Answer - remove if you provide an answer
pokemon_final_kmeans <- k_means(num_clusters = 3) |>
set_engine("stats", nstart = 10)
pkm_kmeans <- workflow()|>
add_recipe(pokemon_recipe)|>
add_model(pokemon_final_kmeans)|>
fit(data = pokemon)
pokemon_final_clusters <- augment(pkm_kmeans, pokemon)
pokemon_final_clusters_plot <- pokemon_final_clusters|>
ggplot (aes (x = Speed, y = Defense, colour = .pred_cluster))+
geom_point(alpha = 0.62)+
labs(x = "Speed (Base modifier)", y = "Base Damage Resistance", colour = "Cluster Type")
#pkm_kmeans
pokemon_final_clusters_plot
Question 1.11
{points: 3}
This looks perhaps a bit better than when we used $K=4$ clusters originally, but is it really a lot better? Use this plot and the elbow plot from Question 1.8 to reason about what might be going on here.
I don't think that this graph is much better than the K = 4 Cluster graph.
When we look at the elbow graph of Question 1.8, we can clearly see that past K = 3, we start to have diminishing returns (interms of distance of WSS). Due to this, higher K values will often cause over-subdivision of our clusters. While it is true that the WSSD values are lower, we are creating more divisions / groups within our data. This is also why our graph (for K = 4) has 3 different colors for scatter points.
2. Tourism Reviews¶
Source: https://media.giphy.com/media/xUNd9IsOQ4BSZPfnLG/giphy.gif
The Ministry of Land, Infrastructure, Transport and Tourism of Japan is interested in knowing the type of tourists that visit East Asia. They know the majority of their visitors come from this region and would like to stay competitive in the region to keep growing the tourism industry. For this, they have hired us to perform segmentation of the tourists. A dataset from TripAdvisor has been scraped and it's provided to you.
This dataset contains the following variables:
- User ID : Unique user id
- Category 1 : Average user feedback on art galleries
- Category 2 : Average user feedback on dance clubs
- Category 3 : Average user feedback on juice bars
- Category 4 : Average user feedback on restaurants
- Category 5 : Average user feedback on museums
- Category 6 : Average user feedback on resorts
- Category 7 : Average user feedback on parks/picnic spots
- Category 8 : Average user feedback on beaches
- Category 9 : Average user feedback on theaters
- Category 10 : Average user feedback on religious institutions
Question 2.0
{points: 3}
Load the data set from https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv and clean it so that only the Category # columns are in the data frame (i.e., remove the User ID column).
Assign your answer to an object called clean_reviews.
# your code here
#fail() # No Answer - remove if you provide an answer
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00484/tripadvisor_review.csv", "data/data.csv")
clean_reviews <- read_csv("data/data.csv")|>
select(-"User ID")
clean_reviews
cell-6050fab33fa36da5
Score: 3.0 / 3.0 (Top)
test_that('Did not create an object called clean_reviews', {
expect_true(exists("clean_reviews"))
})
# The remainder of the tests were intentionally hidden so that you can practice deciding
# when you have the correct answer.
### BEGIN HIDDEN TESTS
test_that('clean_reviews is not a tibble', {
expect_true('tbl' %in% class(clean_reviews))
})
test_that('clean_reviews does not contain the correct data', {
expect_equal(dim(clean_reviews), c(980, 10))
expect_false('User ID' %in% colnames(clean_reviews))
expect_equal(digest(int_round(sum(clean_reviews$`Category 1`), 2)), '2885dff545169dfd21c255f972530c43')
expect_equal(digest(int_round(sum(clean_reviews$`Category 10`), 2)), '0a3dc4a87827d808b12aaf765f7823e2')
})
print("Success!")
### END HIDDEN TESTS
Question 2.1
{points: 3}
Perform K-means and vary $K$ from 1 to 10 to identify the optimal number of clusters. Use nstart = 100. Assign your answer to a tibble object called tourism_elbow_stats that has the columns num_clusters and total_WSSD.
Afterwards, create an elbow plot to help you choose $K$. Assign your answer to an object called tourism_elbow_plot.
#DON'T CHANGE THIS SEED VALUE
set.seed(2019)
tourism_recipe <- recipe( ~ . , clean_reviews) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
k_vals <- tibble(num_clusters = 1:10)
tourism_spec_tune <- k_means(num_clusters = tune()) |>
set_engine("stats", nstart = 100)
tourism_wkflow <-workflow()|>
add_recipe(tourism_recipe) |>
add_model(tourism_spec_tune)|>
tune_cluster(resamples = apparent(clean_reviews), grid = k_vals) |>
collect_metrics()
tourism_elbow_stats <- tourism_wkflow |>
mutate(total_WSSD = mean) |>
filter(.metric == "sse_within_total") |>
select(num_clusters, total_WSSD)
tourism_elbow_stats
tourism_elbow_plot <- tourism_elbow_stats|>
ggplot (aes (x = num_clusters, y = total_WSSD))+
geom_line()+
geom_point()+
xlab("K value")+
ylab("Total Within Cluster \n Sum of Squares")+
scale_x_continuous(breaks = seq(0, 10, by = 1))
tourism_elbow_plot
# your code here
#fail() # No Answer - remove if you provide an answer
cell-6606935f013bdc26
Score: 3.0 / 3.0 (Top)
test_that('Did not create an object called elbow_stats', {
expect_true(exists('elbow_stats'))
})
test_that('Did not create a plot called tourism_elbow_plot', {
expect_true(exists('tourism_elbow_plot'))
})
# The remainder of the tests were intentionally hidden so that you can practice deciding
# when you have the correct answer.
### BEGIN HIDDEN TESTS
test_that('elbow_stats is not a tibble', {
expect_true('tbl' %in% class(elbow_stats))
})
test_that('elbow_stats does not contain the correct data', {
expect_equal(dim(elbow_stats), c(10, 2))
expect_equal(digest(int_round(sum(elbow_stats$num_clusters), 2)), 'ae80db6ce6693e364a09f88dd4e3d553')
expect_equal(digest(int_round(sum(elbow_stats$total_WSSD), 2)), 'ec4cf6680f8fc2d23a8982f80c85093f')
})
test_that('nstart is not 100', {
expect_equal(digest(sum(elbow_plot$iter)), '1473d70e5646a26de3c52aa1abd85b1f')
})
properties <- c(tourism_elbow_plot$layers[[1]]$mapping, tourism_elbow_plot$mapping)
properties2 <- c(tourism_elbow_plot$later[[2]]$mapping, tourism_elbow_plot$mapping)
test_that('tourism_elbow_plot should be a line plot with points', {
expect_true("GeomPoint" %in% c(class(tourism_elbow_plot$layers[[1]]$geom), class(tourism_elbow_plot$layers[[2]]$geom)))
expect_true("GeomLine" %in% c(class(tourism_elbow_plot$layers[[1]]$geom), class(tourism_elbow_plot$layers[[2]]$geom)))
})
test_that('num_clusters should be on the x-axis', {
expect_true(rlang::get_expr(properties$x) == 'num_clusters')
})
test_that('total_WSSD should be on the y-axis', {
expect_true(rlang::get_expr(properties$y) == 'total_WSSD')
})
test_that('Labels on the axes should be descriptive and human readable.', {
expect_false((tourism_elbow_plot$labels$y) == 'total_WSSD')
})
print('Success!')
### END HIDDEN TESTS
Question 2.2
{points: 3}
From the elbow plot above, which $k$ should you choose? Explain why you chose that $k$.
I choose a K value of 2. This is because the decrease in WSSD decreased the most when we went from K = 1 to 2. Then anything further (K value > 2), did not have nearly as big of a "jump" / cliff.
As mentioned earlier as well, a K value that's too high will lead to every piece of data being sub-divided. And you'll end up with a large amount of clusters. A K value of 2 ensures that we still have appropriate clustering. (Will not lead to clustering being too "varied", and general)
This is why I chose a K value of 2.
Question 2.3
{points: 3}
Run K-means again, with the optimal $K$, and assign your answer to an object called reviews_clusters. Use nstart = 100. Then, use the augment function to get the cluster assignments for each point. Name the data frame cluster_assignments.
#DONT CHANGE THIS SEED VALUE
set.seed(2019)
# your code here
#fail() # No Answer - remove if you provide an answer
tourism_final_kmeans <- k_means(num_clusters = 2) |>
set_engine("stats", nstart = 100)
tourism_fits <- workflow()|>
add_recipe(tourism_recipe)|>
add_model(tourism_final_kmeans)|>
fit(data = clean_reviews)
cluster_assignments <- augment(tourism_fits , clean_reviews)
cluster_assignments
For the following 2 questions use the following plot as reference.
The visualization below is a density plot, you can think of it as a smoothed version of a histogram. Density plots are more effective for comparing multiple distributions. What we are looking for with these visualizations, is to see which variables have difference distributions between the different clusters.
options(repr.plot.height = 8, repr.plot.width = 15)
cluster_assignments |>
pivot_longer(cols = -.pred_cluster, names_to = 'category', values_to = 'value') |>
ggplot(aes(value, fill = .pred_cluster)) +
geom_density(alpha = 0.4, colour = 'white') +
# We are setting the x-scale to "free" since we standardized the rating values before clustering them,
# which means that their original range (which is what we show here) does not matter
facet_wrap(facets = vars(category), scales = 'free') +
theme_minimal() +
theme(text = element_text(size = 20))
Question 2.4 Multiple Choice:
{points: 1}
From the plots above, point out the categories that we might hypothesize are driving the clustering? (i.e., are useful to distinguish between the type of tourists?) We list the table of the categories below.
- Category 1 : Average user feedback on art galleries
- Category 2 : Average user feedback on dance clubs
- Category 3 : Average user feedback on juice bars
- Category 4 : Average user feedback on restaurants
- Category 5 : Average user feedback on museums
- Category 6 : Average user feedback on resorts
- Category 7 : Average user feedback on parks/picnic spots
- Category 8 : Average user feedback on beaches
- Category 9 : Average user feedback on theaters
- Category 10 : Average user feedback on religious institutions
A. 10, 3, 5, 6, 7
B. 10, 3, 5, 6, 1
C. 10, 3, 4, 6, 7
D. 10, 2, 5, 6, 7
Assign your answer to an object called answer2.4. Make sure your answer is an uppercase letter and is surrounded by quotation marks (e.g. "F").
# your code here
#fail() # No Answer - remove if you provide an answer
answer2.4 <- "A"
cell-96590ae632b0f9c5
Score: 1.0 / 1.0 (Top)
### BEGIN HIDDEN TESTS
test_that('Did not create an object called answer2.4', {
expect_true(exists('answer2.4'))
})
test_that('Solution is incorrect', {
expect_equal(digest(answer2.4), '75f1160e72554f4270c809f041c7a776')
})
print("Success!")
### END HIDDEN TESTS
Question 2.5
{points: 3}
Discuss one disadvantage of not being able to visualize the clusters when dealing with multidimensional data.
The biggest issue with not being able to visualize multi-dimensional data (with clusters), is that it's hard to see which variables contribute the most to clustering. (which is also impacts how they are grouped). As humans, one of the best ways to visualize and learn from data is through graphs, and when we don't have visuals, everything just becomes numbers and values.
source("cleanup.R")